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Abstract 

^ | ■ Consider the standard linear regression model Y = X/3* + w, where Y £ W l is an observation 

vector, X £ M. nxd is a design matrix, (3* £ M. d is the unknown regression vector, and w ~ 
Af(0, u 2 I) is additive Gaussian noise. This paper studies the minimax rates of convergence for 
estimation of j3* for £ p -losses and in the ^-prediction loss, assuming that j3* belongs to an £ g -ball 
B q (Rq) for some q £ [0, 1]. We show that under suitable regularity conditions on the design 

matrix X, the minimax error in ^2-loss and ^-prediction loss scales as R q ( ^jp^ ) 1 3 . In addition, 
we provide lower bounds on minimax risks in ^,-norms, for all p £ [1, +oo],p ^ q. Our proofs 
of the lower bounds are information-theoretic in nature, based on Fano's inequality and results 
on the metric entropy of the balls M q (R q ), whereas our proofs of the upper bounds are direct 
and constructive, involving direct analysis of least-squares over £ 9 -balls. For the special case 
q — 0, a comparison with ii -risks achieved by computationally efficient l\ -relaxations reveals 
• that although such methods can achieve the minimax rates up to constant factors, they require 

slightly stronger assumptions on the design matrix X than algorithms involving least-squares over 
the 4-ball. 



1 Introduction 



- T— I ■ 

The area of high-dimensional statistical inference concerns the estimation in the "large d, small n" 
regime, where a refers to the ambient dimension of the problem and n refers to the sample size. Such 
high-dimensional inference problems arise in various areas of science, including astrophysics, remote 
sensing and geophysics, and computational biology, among others. In the absence of additional struc- 
ture, it is frequently impossible to obtain consistent estimators unless the ratio d/n converges to zero. 
However, many applications require solving inference problems with d > n, so that consistency is 
not possible without imposing additional structure. Accordingly, an active line of research in high- 
dimensional inference is based on imposing various types of structural conditions, such as sparsity, 
manifold structure, or graphical model structure, and then studying the performance of different esti- 
mators. For instance, in the case of models with some type of sparsity constraint, a great deal of of 
work has studied the behavior of l\ -based relaxations. 

Complementary to the understanding of computationally efficient procedures are the fundamental 
or information-theoretic limitations of statistical inference, applicable to any algorithm regardless 
of its computational cost. There is a rich line of statistical work on such fundamental limits, an 
understanding of which can have two types of consequences. First, they can reveal gaps between the 
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performance of an optimal algorithm compared to known computationally efficient methods. Second, 
they can demonstrate regimes in which practical algorithms achieve the fundamental limits, which 
means that there is little point in searching for a more effective algorithm. As we shall see, the results 
in this paper lead to understanding of both types. 



1.1 Problem set-up 

The focus of this paper is a canonical instance of a high-dimensional inference problem, namely that 
of linear regression in d dimensions with sparsity constraints on the regression vector (3* G M d . In this 
problem, we observe a pair (Y, X) G W n x M nx<i , where X is the design matrix and Y is a vector of 
response variables. These quantities are linked by the standard linear model 

Y = Xp* + w, (1) 

where w ~ N(0, a 2 I nxn ) is observation noise. The goal is to estimate the unknown vector j3* G K d 
of regression coefficients. The sparse instance of this problem, in which (3* satisfies some type of 
sparsity constraint, has been investigated extensively over the past decade. Let Xj denote the i th row 
of X and Xj denote the j th column of X. A variety of practical algorithms have been proposed and 
studied, many based on ^i-regularization, including basis pursuit [9], the Lasso [31], and the Dantzig 
selector [6]. Various authors have obtained convergence rates for different error metrics, including 
^2-error [6, 4, 37], prediction loss [4, 16], as well as model selection consistency [37, 25, 33, 38]. 
In addition, a range of sparsity assumptions have been analyzed, including the case of hard sparsity 
meaning that 0* has exactly s«d non-zero entries, or soft sparsity assumptions, based on imposing 
a certain decay rate on the ordered entries of (3* . 



Sparsity constraints These notions of sparsity can be defined more precisely in terms of the l q - 
balls 1 for q G [0, 1], defined as 

d 

B q (R q ) := {PeR d | \\P\\ q q = ^2W q < R «}> ( 2 ) 

i=i 

where in the limiting case q = 0, we have the £o-ball 

d 

B (s) := {PeR d | J^I^O] <s}, (3) 

i=i 

corresponding to the set of vectors (3 with at most s non-zero elements. 

Loss functions We consider estimators (3 : M. n x M. nxd — > R d that are measurable functions of the 
data (y, X). Given any such estimator of the true parameter (3*, there are many criteria for determining 
the quality of the estimate. In a decision-theoretic framework, one introduces a loss function such that 
C((3, (3*) represents the loss incurred by estimating (3 when (3* G M q (R q ) is the true parameter. The 
associated risk 1Z is the expected value of the loss over distributions of (Y, X) — namely, the quantity 
!Z{(3,f3*) = E [£(/?, /?*)]. Finally, in the minimax formalism, one seeks to choose an estimator that 
minimizes the worst-case risk given by 

min max K(j3,(3*). (4) 

P*& q (R q ) 



'Strictly speaking, these sets are not "balls" when q < 1, since they fail to be convex. 
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Various choices of the loss function are possible, including (a) the model selection loss, which is 
zero if supp(/3) = supp(/3*) and one otherwise; (b) the £ p -losses 

d 

c P (An : = \\p-PT P = Efe-w (5) 

J"=l 

and (c) the ^-prediction loss \\X(/3 — /3* ) 1 1 2 / ri - m this paper, we study the ^-losses and the £2- 
prediction loss. 

1.2 Our main contributions and related work 

In this paper, we study minimax risks for the high-dimensional linear model (1), in which the regres- 
sion vector (3* belongs to the ball M q (R q ) for < q < 1. The core of the paper consists of four 
main theorems, corresponding to lower bounds on minimax rate for the cases of £ p losses and the 
^2-prediction loss, and upper bounds for ^2-norm loss and the £2 -prediction loss. More specifically, 
in Theorem 1, we provide lower bounds for ^p-losses that involve a maximum of two quantities: a 
term involving the diameter of the null-space restricted to the ^-ball, measuring the degree of non- 
identifi ability of the model, and a term arising from the ^-metric entropy structure for ^-balls, mea- 
suring the massiveness of the parameter space. Theorem 2 is complementary in nature, devoted to 
upper bounds for ^Toss. For ^Toss, the upper and lower bounds match up to factors independent of 
the triple (n,d,R q ), and depend only on structural properties of the design matrix X (see Theorems 1 
and 2). Finally, Theorems 3 and 4 provide upper and lower bounds for ^-prediction loss. For the 
^2-prediction loss, we provide upper and lower bounds on minimax risks that are again matching up 
to factors independent of (n, d, R q ), as summarized in Theorems 3 and 4. Structural properties of the 
design matrix X again play a role in minimax ^-prediction risks, but enter in a rather different way 
than in the case of ^Toss. 

For the special case of the Gaussian sequence model (where X = ^/nl n xn), our work is closely 
related to the seminal work by of Donoho and Johnstone [14], who determined minimax rates for £ p - 
losses over £ q -balls. Our work applies to the case of general X, in which the sample size n need not 
be equal to the dimension d; however, we re-capture the same scaling as Donoho and Johnstone [14] 
when specialized to the case X = y/nl nxn . In addition to our analysis of £ p -loss, we also determine 
minimax rates for ^-prediction loss which, as mentioned above, can behave very differently from the 
^2-loss for general design matrices X. During the process of writing up our results, we became aware 
of concurrent work by Zhang (see the brief report [36]) that also studies the problem of determining 
minimax upper and lower bounds for £ p -losses with ^ g -sparsity. We will be able to make a more 
thorough comparison once a more detailed version of their work is publicly available. 

Naturally, our work also has some connections to the vast body of work on £\ -based methods for 
sparse estimation, particularly for the case of hard sparsity (q = 0). Based on our results, the rates 
that are achieved by £\ -methods, such as the Lasso and the Dantzig selector, are minimax optimal for 
^2-loss, but require somewhat stronger conditions on the design matrix than an "optimal" algorithm, 
which is based on searching the ^o-ball. We compare the conditions that we impose in our minimax 
analysis to various conditions imposed in the analysis of £\ -based methods, including the restricted 
isometry property of Candes and Tao [6], the restricted eigenvalue condition imposed in Menshausen 
and Yu [26], the partial Riesz condition in Zhang and Huang [37] and the restricted eigenvalue condi- 
tion of Bickel et al. [4]. We find that "optimal" methods, which are based on minimizing least-squares 
directly over the £o-ball, can succeed for design matrices where ^i-based methods are not known to 
work. 
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The remainder of this paper is organized as follows. In Section 2, we begin by specifying the 
assumptions on the design matrix that enter our analysis, and then state our main results. Section 3 
is devoted to discussion of the consequences of our main results, including connections to the normal 
sequence model, Gaussian random designs, and related results on l\ -based methods. In Section 4, we 
provide the proofs of our main results, with more technical aspects deferred to the appendices. 



2 Main results 

This section is devoted to the statement of our main results, and discussion of some of their conse- 
quences. We begin by specifying the conditions on the high-dimensional scaling and the design matrix 
X that enter different parts of our analysis, before giving precise statements of our main results. 

In this paper, our primary interest is the high-dimensional regime in which d S> n. For technical 
reasons, for q £ (0, 1], we require the following condition on the scaling of (n,d,R q ): 

— ^-7- = Q(d K ) for some k > 0. (6) 

In the regime d > n, this assumption will be satisfied for all q £ (0, 1] as long as R q = o(d2~ K ') for 
some k' G (0, 1/2), which is a reasonable condition on the radius of the l^-ball for sparse models. 
In the work of Donoho and Johnstone [14] on the normal sequence model (special case of X = I), 
discussed at more length in the sequel, the effect of the scaling of the quantity ^9/2 on me rate °f 
convergence also requires careful treatment. 



2.1 Assumptions on design matrices 

Our first assumption, imposed throughout all of our analysis, is that the columns {Xj,j = 1, . . . ,d} 
of the design matrix X are bounded in £2 -norm: 

Assumption 1 (Column normalization). There exists a constant < k c < +00 such that 

—= max IIXolU < k c . (7) 
y/nj=i,...,d 

In addition, some of our results involve the set defined by intersecting the kernel of X with the 
£ q -ball, which we denote N q (X) : = Ker(X) n M q (R q ). We define the M q (R q )-kernel diameter in the 
£p-norm 

&\&m v {N„{X)) := max II0IL = max \\d\\v (8) 
eeM q (x) y \\e\\ q q <R q ,xe=o 

The significance of this diameter should be apparent: for any "perturbation" A £ J\f q (X), it follows 
immediately from the linear observation model (1) that no method could ever distinguish between 
j3* = and jS* = A. Consequently, this M q (R q ) -kernel diameter is a measure of the lack ofidentifia- 
bility of the linear model (1) over M q (R q ). 

Our second assumption, which is required only for achievable results for £2 -error and lower bounds 
for £2 -prediction error, imposes a lower bound on ||X0||2/\/n in terms of \\9\\2 and a residual term: 

Assumption 2 (Lower bound on restricted curvature). There exists a constant K£ > and a function 
fi{R q , n, d) such that 

-^=\\Xe\\2 > Ke\Wh -MR q ,n,d) forall 9 G M q (2R q ). (9) 
Jn 
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Remarks: Conditions on the scaling for fi(R q , n, d) are provided in Theorems 2 and 3. It is useful 
to recognize that the lower bound (9) is closely related to the diameter condition (8); in particular, 
Assumption 2 induces an upper bound on the B 9 (i? g )-kernel diameter in £ 2 -norm, an d hence the 
identifiability of the model: 

Lemma 1. If Assumption 2 holds for any q £ (0, 1], then the M q (R q )-kernel diameter in l^-norm is 
upper hounded as 

diam 2 (^(X)) < ^ R ^ d \ 

Proof. We prove the contrapositive statement. Note that if diam^A/^X)) > f*( R v> n > d ^ then there 
must exist some 9 G M q (R q ) with X6 = and ||0|| 2 > fe( - R ^ n4) . We then conclude that 

= — ||X0|| 2 < k £ \\9\\ 2 - ft(Rg,n,d), 

which implies there cannot exist any K£ for which the lower bound (9) holds. □ 

In Section 3.3, we discuss further connections between our assumptions, and the conditions im- 
posed in analysis of the Lasso and other l\ -based methods [6, 25, 4]. In the case q = 0, we find that 
Assumption 2 is weaker than any condition under which an l\ -based method is known to succeed. 
Finally, in Section 3.2, we prove that versions of both Assumptions 1 and 2 hold with high probability 
for various classes of non-i.i.d. Gaussian random design matrices (see Proposition 1). 



2.2 Risks in ^-norm 

Having described our assumptions on the design matrix, we now turn to the main results that provide 
upper and lower bounds on minimax risks. In all of the statements to follow, we use the quantities 
Cg :P , d 2 , c q: 2 etc. to denote numerical constants, independent of n, d, R q , a 1 and the design matrix 
X. We begin with lower bounds on the £ p -risk. 

Theorem 1 (Lower bounds on ^,-risk). Consider the linear model (I) for a fixed design matrix X G 

(a) Conditions for q £ (0, 1]: Suppose that X is column-normalized (Assumption 1 with k c < oo). 
For any p S [1, oo), the minimax over the £ q ball is lower bounded as 

min max Ell/3 — > c q „ max< diamJ'fA/'gfX)), R q 



a 2 logd 



K 



11 



2 



•• (10) 



(b) Conditions for q = 0: Suppose that ^p\\ 2 — K u for all 8 £ Bq(2s). Then for any p G [1, oo), 
the minimax £ p -risk over the l^-ball with radius s = Rq is lower bounded as 

min max E||/3-/T||£ > c 0p maxj diamg(AA (X)), a§ rgf. lo g( rf / g ) 1 f I 
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Note that both lower bounds consist of two terms. The first term is simply the diameter of the set 
■Af q (X) = Ker(X) n M q (R q ), which reflects the extent which the linear model (1) is unidentifiable. 
Clearly, one cannot estimate /3* any more accurately than the diameter of this set. In both lower 
bounds, the ratios a 2 / k 2 (or a 2 /k^) correspond to the inverse of the signal-to-noise ratio, comparing 
the noise variance a 2 to the magnitude of the design matrix measured by k u . As the proof will clarify, 
the term [logd]~ in the lower bound (10), and similarly the term log(-) in the bound (11), are 
reflections of the complexity of the £ g -ball, as measured by its metric entropy. For many classes of 
design matrices, the second term is of larger order than the diameter term, and hence determines the 
rate. (In particular, see Section 3.2 for an in-depth discussion of the case of random Gaussian designs.) 

We now state upper bounds on the £2 -norm minimax risk over £ q balls. For these results, we require 
both the column normalization condition (Assumption 1) and the curvature condition (Assumption 2). 

Theorem 2 (Upper bounds on ^2-risk). Consider the model (1) with a fixed design matrix X £ M. nxd 
that is column-normalized (Assumption 1 with k c < 00). 

(a) Conditions for q G (0,1]: If X satisfies Assumption 2 with f e (R q ,n,d) = o(R q 1/2 C-^) 1/2 ~ q/4 ) 
and Kg > 0, then there exist constants c\ and C2 such that the minimax i^-risk is upper hounded 
as 



K 2 c a 2 log^i-g/2 

2 2 ' 



min max 

$ fB*eM q (R q ) 

with probability greater than 1 — c\ exp (— C2n). 

(b) Conditions for q = 0: If X satisfies Assumption 2 with fg(s, n, d) = and Kg > 0, then there 
exists constants c\ and C2 such that the minimax l2-risk is upper bounded as 

llfl fl*l|2 ^ r k c o" 2 s logd 
mm max p — p 2 S o— , — ^ , (13) 

p /3*GBo(s) k\ k\ n 

with probability greater than 1 — c\ exp (— C2ra). If in addition, the design matrix satisfies 
1 < K ufor all 9 £ Bo(2s), then the minimax l2-risk is upper bounded as 



y/ri\\e\ 



minimax J0 - fT\$ < 144 < °- 2 S J^1, (14 ) 



r\\l 

p /3*eB (s) " " k\ k\ n 



with probability greater than 1 — c\ exp ( — C2S log(d — s)). 

In the case of ^2-risk and design matrices X that satisfy the assumptions of both Theorems 1 and 2, 
then these results identify the minimax risk up to constant factors. In particular, for q £ (0,1], the 
minimax ^2-risk scales as 



min max Ell/3 - f3*\\i = 6 [R a 
p p*& q {R q ) 



a 2 log d 



n 



i-?/2\ 

(15) 



whereas for q = 0, the minimax ^2-risk scales as 



a 2 slog(d/s) 



min max Ell/3- f3*\\l = 9( ~ "" Brr/ ). (16) 

p /3*GB (s) V n J 

Note that the bounds with high probability can be converted to bound in expectation by a standard 
integration over the tail probability. 
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2.3 Risks in prediction norm 

In this section, we investigate minimax risks in terms of the £2 -prediction loss \\X(3 — /3*)|||/n, and 
provide both lower and upper bounds on it. 

Theorem 3 (Lower bounds on prediction risk). Consider the model (1) with a fixed design matrix 
X G M. nxd that is column-normalized (Assumption 1 with k c < 00). 

(a) Conditions for q G (0, 1]: If the design matrix X satisfies Assumption 2 with K£ > and 
fe(R q , n, d) = o(R g 1 ^ 2 (^^) 1 ^ 2 ~ q ^ 4 ), then the minimax prediction risk is lower bounded as 



\x(p-p)\\l ^ , o 2 r^ 2 iogd 



min max E > c 2 „ R q K f 

p pea q (R q ) n ' q inj, n 



1-9/2 



(17) 



(b) Conditions for q = 0: Suppose that X satisfies Assumption 2 with Kg > and fe(s, n, d) = 0, 

l|A-fl|| 



11 X.0W > 

and moreover that 'U„ ' j 2 < n u for all 9 € Bo (2s). Then the minimax prediction risk is lower 



bounded as 

J\X0-P)\\l , 2 a 2 s\og{d/s) 
mm max > c 0o Kt — 7 . (lo) 

p /3eB (s) n n 

In the other direction, we have the following result: 

Theorem 4 (Upper bounds on prediction risk). Consider the model (1) with a fixed design matrix 
X G R nxd . 

(a) Conditions for q G (0,1]: If X satisfies the column normalization condition, then for some 
constant C2 >q , there exist c\ and c 2 such that the minimax prediction risk is upper bounded as 



min max — \\X(B — 3*)\\% C2 a K 2 r R a 
p /3*& q (R q ) n" 



r cr 2 losd 



k 2 . n 



. 2 



(19) 



with probability greater than 1 — c\ exp {-c 2 R q {\ogd) l - q l 2 ni/ 2 ). 

(b) Conditions for q = 0: For any X, with probability greater than 1 — exp (— 10s log(d/s)) the 
minimax prediction risk is upper bounded as 

min max l -\\X0 - 8*)\\l < Sl^^fl. (20 ) 

p p*&o(s) n n 

2.4 Some intuition 

In order to provide the reader with some intuition, let us make some comments about the scalings that 
appear in our results. 

First, as a basic check of our results, it can be verified that Lemma 1 ensures that the lower bounds 
on minimax rates stated in Theorem 1 for p = 2 are always less than or equal to the achievable 
rates stated in Theorem 2. In particular, since f e (R q ,n,d) = o{R q 1/2 {^) 1/2 ^ q/4 ) for q G (0, 1], 
Lemma 1 implies that diam^A/^X)) = o{R q { l ^) l ~ q / 2 ), meaning that the achievable rates are 
always at least as large as the lower bounds in the case q G (0, 1]. In the case of hard sparsity (q = 0), 
the upper and lower bounds are clearly consistent since fe(s, n, d) = implies the diameter of Mo(X) 
is 0. 
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Second, for the case q = 0, there is a concrete interpretation of the rate — „ > which appears in 
Theorems 1(b), 2(b), 3(b) and 4(b)). Note that there are (f) subsets of size s within {1,2,..., d}, and 
by standard bounds on binomial coefficients [11], we have log (j) = 0(s log(d/ s)). Consquently, the 
rate - log ^ s ^ corresponds to the log number of models divided by the sample size n. Note that unless 
s/d = 0(1), this rate is equivalent (up to constant factors) to s ° g . 

Third, for q G (0, 1], the interpretation of the rate Rq(^^ L ) 1 Q ^ 2 , appearing in parts (a) of The- 
orems 1 through 4, is less immediately obvious but can can understood as follows. Suppose that we 
choose a subset of size s q of coefficients to estimate, and ignore the remaining d — s q coefficients. 
For instance, if we were to choose the top s q coefficients of j3* in absolute value, then the fast decay 
imposed by the ^-ball condition on j3* would mean that the remaining d — s q coefficients would have 
relatively little impact. With this intuition, the rate for q > can be interpreted as the rate that would 
be achieved by choosing s q = Rq(^^) , and then acting as if the problem were an instance of 
a hard-sparse problem (q = 0) with s = s q . For such a problem, we would expect to achieve the 

rate Sq *° g d , which is exactly equal to R q 0^^) 1 q ^ . Of course, we have only made a very heuristic 
argument here; this truncation idea is made more precise in Lemma 2 to appear in the sequel. 

Fourth, we note that the minimax rates for £2 -prediction error and £2 -norm error are essentially the 
same except that the design matrix structure enters minimax risks in very different ways. In particular, 
note that proving lower bounds on prediction risk requires imposing relatively strong conditions on the 
design X — namely, Assumptions 1 and 2 as stated in Theorem 3. In contrast, obtaining upper bounds 
on prediction risk requires very mild conditions. At the most extreme, the upper bound for q = in 
Theorem 3 requires no assumptions on X while for q > only the column normalization condition 
is required. All of these statements are reversed for ^-risks, where lower bounds can be proved with 
only Assumption 1 on X (see Theorem 1), whereas upper bounds require both Assumptions 1 and 2. 

Lastly, in order to appreciate the difference between the conditions for ^-prediction error and £2 
error, it is useful to consider a toy but illuminating example. Consider the linear regression problem 
defined by a design matrix X = \X\ X2 • • • Xj] with identical columns — that is, Xj = X\ for 
all j = 1, ... , d. We assume that vector X\ G W 1 is suitably scaled so that the column-normalization 
condition (Assumption 1) is satisfied. For this particular choice of design matrix, the linear observation 
model (1) reduces to Y = (X)f=i + w. For the case of hard sparsity (q = 0), an elementary 

argument shows that the minimax risk in ^-prediction error scales as 0(^). This scaling implies that 
the upper bound (20) from Theorem 4 holds (but is not tight). 2 Consequently, this highly degenerate 
design matrix yields a very easy problem for £2 -prediction, since the 1/n rate is essentially parametric. 
In sharp contrast, for the case of ^2-norm error (still with hard sparsity q = 0), the model becomes 
unidentifiable. To see the lack of identifi ability, let G M. d denote the unit-vector with 1 in position 
i, and consider the two regression vectors 0* = ce± and (3 = ce2, for some constant c G M. Both 
choices yield the same observation vector Y, and since the choice of c is arbitrary, the minimax £2- 
error is infinite. In this case, the lower bound (11) on ^-error from Theorem 1 holds (and is tight, 
since the kernel diameter is infinite). In contrast, the upper bound (13) on ^-error from Theorem 2 
does not apply, because Assumption 2 is violated due to the extreme degeneracy of the design matrix. 

3 Some consequences 

In this section, we discuss some consequences of our results. We begin by considering the classical 
Gaussian sequence model, which conesponds to a special case of our linear regression model, and 

2 Note that the lower bound (18) on the £2 -prediction error from Theorem 3 does not apply to this model, since this 
degenerate design matrix with identical columns does not satisfy any version of Assumption 2. 
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making explicit comparisons to the results of Donoho and Johnstone [14] on minimax risks over l q - 
balls. 



3.1 Connections with the normal sequence model 

The normal (or Gaussian) sequence model is defined by the observation sequence 

yi = 6* + Si, for i = 1, . . . , n, (21) 



2 

where 0* G C M. n is a fixed but unknown vector, and the noise variables £j ~ M (0, 1 —) are 
i.i.d. normal variates. Many non-parametric estimation problems, including regression and density 
estimation, are asymptotically equivalent to an instance of the Gaussian sequence model [28, 27, 5], 
where the set depends on the underlying "smoothness" conditions imposed on the functions. For 
instance, for functions that have an m th derivative that is square-differentiable (a particular kind of 
Sobolev space), the set corresponds to an ellipsoid; on the other hand, for certain choices of Besov 
spaces, it corresponds to an ^-ball. 

In the case = M q (R q ), our linear regression model (1) includes the normal sequence model (21) 
as a special case. In particular, it corresponds to setting d = n, the design matrix X = I nxn , and noise 
variance a 2 = — . For this particular model, seminal work by Donoho and Johnstone [14] derived 
sharp asymptotic results on the minimax error for general £ p -norms over £ q balls. Here we show that 
a corollary of our main theorems yields the same scaling in the case p = 2 and q G [0,1]. 

Corollary 1. Consider the normal sequence model (21) with = M q (R q ) for some q G (0, 1]. Then 
there are constants d q < c q depending only on q such that 



, ,2r 2 logn u _2 . . ^,,-3 „* l|2 . ,2T 2 logn x _ 



•i 



c'( < min max Ell/3 - P*M < cJ —V'*. (22) 

q n p /3*em q (R q ) ' n 

These bounds follow from our main theorems, via the substitutions n = d, a 2 = — , and 
k u = K£ = 1. To be clear, Donoho and Johnstone [14] provide a far more careful analysis that yields 
shaiper control of the constants than we have provided here. 



3.2 Random Gaussian Design 

Another special case of particular interest is that of random Gaussian design matrices. A widely 
studied instance is the standard Gaussian ensemble, in which the entries of X G W nxd are i.i.d. 
N(0, 1) variates. A variety of results are known for the singular values of random matrices X drawn 
from this ensemble (e.g., [2, 3, 12]); moreover, some past work [13, 6] has studied the behavior 
of different £i-based methods for the standard Gaussian ensemble, in which entries Xij are i.i.d. 
N(0, 1). In modeling terms, requiring that all entries of the design matrix X are i.i.d. is an overly 
restrictive assumption, and not likely to be met in applications where the design matrix cannot be 
chosen. Accordingly, let us consider the more general class of Gaussian random design matrices 
X G M nxd , in which the rows are independent, but there can be arbitrary correlations between the 
columns of X. To simplify notation, we define the shorthand p(S) : = max J= i ... ^ Hjj, corresponding 
to the maximal variance of any element of X, and use E 1 / 2 to denote the symmetric square root of the 
covariance matrix. 

In this model, each column Xj,j = 1, . . . , d has i.i.d. elements. Consequently, it is an immediate 
consequence of standard concentration results for Xn variates (see Appendix I) that 

max &< P (E)(1 + V ^5?). (23) 
j=i,...,d V n 
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Therefore, Assumption 1 holds as long as n = Q(logd) and p(E) is bounded. 

Showing that a version of Assumption 2 holds with high probability requires more work. We 
summarize our findings in the following result: 

Proposition 1. Consider a random design matrix X G W ixd formed by drawing each row X{ G M. d 
i.i.d. from an N(0, E) distribution. Then for some numerical constants c& G (0, oo), k = 1,2, we 
have 

> l||sV 2 ,|| 2 -6( P(£)l0gd ) 1/2 \\v\U forallv^ (24) 
Jn 2 v n ' 



with probability 1 — c\ exp(— C2n). 

Remarks: Past work by by Amini and Wainwright [1] in the analysis of sparse PC A has established 
an upper bound analogous to the lower bound (24) for the special case E = Idxd- We provide a 
proof of this matching upper bound for general E as part of the proof of Proposition 1 in Appendix E. 
The argument is based on Slepian's lemma [12] and its extension due to Gordon [15], combined with 
concentration of Gaussian measure results [22]. Note that we have made no effort to obtain sharp 
leading constants (i.e., the factors 1/2 and 6 can easily be improved), but the basic result (24) suffices 
for our purposes. 

Let us now discuss the implications of this result for Assumption 2. First, in the case q = 0, the 
bound (13) in Theorem 2 requires that Assumption 2 holds with fe(s, n, d) = for all 9 G Bo(2s). To 
see the connection with Proposition 1, note that if 9 G Bo(2s), then we have ||#||i < \/2s||#||2, and 
hence 

\\Xvh ^ |||EV^|| 2 _ 6v ^ (P (E>logd )1/2 | ||u||2 _ 



n I 2 M 17 1 1 2 n 

Therefore, as long as p(E) < oo, min„ gBo (2 S ) ^ S ||^||^ 2 > an d sl ° gd = o(l), the condition needed 
for the bound (13) will be met. 

Second, in the case q G (0, 1], Theorem 2(a) requires that Assumption 2 hold with the residual 
term fi(R q ,n,d) = o(R q 1 ^ 2 ^ L ^) 1 ^ 2 ~ q ^ 4: . We claim that Proposition 1 guarantees this condition, as 
long as /o(S) < oo and the minimum eigenvalue of E is bounded away from zero. In order to verify 
this claim, we require the following result: 

Lemma 2. For any vector 9 G M q (2R q ) and any positive number r > 0, we have 

\\0\\i < y / 2R~ q T- q/2 \\9\\ 2 + 2R q T 1 - q . (25) 
Although this type of result is standard (e.g, [14]), we provide a proof in Appendix A for completeness. 



In order to exploit Lemma 2, let us set r = y ^f^- With this choice, we can substitute the resulting 
bound (25) into the lower bound (24), thereby obtaining that 

^ * " 6 ^^) ^ (^) 1/2 - 9/4 }IMl2 - 2iV(E)V 2 (^-«/ 2 

- 'n 1 2\\v o n ' x n 



Recalling that the condition V^Rg(^f^) 1 ^ 2 Q ^ = o(l) is required for consistency, we see that As- 
sumption 2 holds as long as /o(E) < +oo and the minimum eigenvalue of E is bounded away from 



zero. 
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Lastly, it is also worth noting that we can also obtain the following stronger result for the case 
q = 0, in the case that mm„ eBo(2s ) ^n^ 2 > and max 1)6Bo(2s ) '^jj," 112 < oo. If the sparse 
eigenspectrum is bounded in this way, then as long as n > C3 s log(d/s), we have 

3||£ 1/2 d| 2 > Hjfc > -||S 1/2 t>|| 2 for all v £ JB (2s) (26) 
V n 2 

with probability greater than 1 — c± exp(— c 2 n). This fact follows by applying the union bound over 
all ( 2 d J subsets of size 2s, combined with standard concentration results for random matrices (e.g., 
see Davidson and Szarek [12] for £ = /, and Wainwright [33] for the straightforward extensions to 
non-identity co variances). 

3.3 Comparison to i x -based methods 

In addition, it is interesting to compare our minimax rates of convergence for ^ 2 -error with known 
results for i\ -based methods, including the Lasso [31] and the closely related Dantzig method [6]. 
Here we discuss only the case q = since we are currently unaware of any ^ 2 -error bound for l\ -based 
methods for q £ (0, 1]. For the Lasso, past work [37, 26] has shown that its £ 2 -error is upper bounded 
by slogd under sparse eigenvalue conditions. Similarly, Candes and Tao [6] show the same scaling for 
the Dantzig selector, when applied to matrices that satisfy the more restrictive RIP conditions. More 
recent work by Bickel et. al [4] provides a simultaneous analysis of the Lasso and Dantzig selector 
under a common set of assumptions that are weaker than both the RIP condition and sparse eigenvalue 
conditions. Together with our results (in particular, Theorem 1(b)), this body of work shows that under 
appropriate conditions on the design X, the rates achieved by ^i-methods in the case of hard sparsity 
(q = 0) are minimax-optimal. 

Given that the rates are optimal, it is appropriate to compare the conditions needed by an "optimal" 
algorithm, such as that analyzed in Theorem 2, to those used in the analysis of ^i-based methods. One 
set of conditions, known as the restricted isometry property [6] or RIP for short, is based on very 
strong constraints on the condition numbers of all submatrices of X up to size 2s, requiring that they 
be near-isometries (i.e., with condition numbers extremely close to 1). Such conditions are satisfied by 
matrices with columns that are all very close to orthogonal (e.g., when X has i.i.d. N(0, 1) entries and 
n = f2(log ( 2 rf s ))), but are violated for many reasonable matrix classes (e.g., Toeplitz matrices) that 
arise in statistical practice. Zhang and Huang [37] imposed a weaker sparse Riesz condition, based on 
imposing constraints (different from those of RIP) on the condition numbers of all submatrices of X 
up to a size that grows as a function of s and n. Meinshausen and Yu [26] impose a bound in terms of 
the condition numbers or minimum and maximum restricted eigenvalues for submatrices of X up to 
size s log n. It is unclear whether the conditions in Meinshausen and Yu [26] are weaker or stronger 
than the conditions in Zhang and Huang [37]. 

The weakest known sufficient conditions to date are due to Bickel et al. [4], who show that in 
addition to the column normalization condition (Assumption 1 in this paper), it suffices to impose a 
milder condition, namely a lower bound on a certain type of restricted eigenvalue (RE). They show 
that this RE condition is less restrictive than both the RIP condition [6] and the eigenvalue conditions 
imposed in Meinshausen and Yu [26]. For a given vector 9 £ IR d , let 9^ refer to the j th largest 
coefficient in absolute value, so that we have the ordering 

0(1) > 0(2) > ■■■> Vi) ^ 9 (d)- 
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For a given scalar cq and integer s = 1,2, 



, d, let define the set 



T(s,c ) : = kl d | £ l%)l<coEl%)l}- 

In words, the set T(s, cq) contains all vectors in M. d where the ^i-norm of the largest s co-ordinates 
provides an upper bound (up to constant cq) to the l\ norm over the smallest d — s co-ordinates. For 
example if d = 3, then the vector (1, 1/2, 1/4) G T(l, 1) whereas the vector (1, 3/4, 3/4) £ T(l, 1). 
With this notation, the restricted eigenvalue (RE) assumption can be stated as follows: 

Assumption 3 (Restricted lower eigenvalues [4]). There exists a function k(X, cq) > such that 




X9\\ 2 > k(X,c )\\9\\ 2 for all G T(s,c ). 



Bickel et. al [4] require a slightly stronger condition for bounding the £ 2 -loss in if s depends on n. 
However the conditions are equivalent for fixed s and Assumption 3 is much simpler to analyze and 
compare to Assumption 2. At this point, we have not seen conditions weaker than Assumption 3. 

The following corollary of Proposition 1 shows that Assumption 3 is satisfied with high probability 
for broad classes of Gaussian random designs: 

Corollary 2. Suppose that p(E) remains bounded, min^g^ ( 2s ) ^n^n^ 2 > ^ an ^ ^ at n ^ c 3 s ^°gd 
for a sufficiently large constant. Then a randomly drawn design matrix X G M. nxd with i.i.d. N(0, S) 
rows satisfies Assumption 3 with probability greater than 1 — c\ exp(— C2fi). 

Proof. Note that for any vector 9 G T(s, Co), we have 

s 

||0||l < (1+C )^|%)| < (l+c o )^||0|| 2 . 

J'=l 

Consequently, if the bound (24) holds, we have 

ll^lb . f ||S 1/2 ^|| 2 ( ., p(g)alogd x i/2\,, n 
— i^>\^-rT, 6 1+c )( J \\\v\\ 2 . 

Since we have assumed that n > c^s log d for a sufficiently large constant, the claim follows. □ 

Combined with the discussion following Proposition 1 , this result shows that both the conditions 
required by Theorem 2 of this paper and the analysis of Bickel et al. [4] (both in the case q = 0) hold 
with high probability for Gaussian random designs. 



3.3.1 Comparison of RE assumption with Assumption 2 

In the case q = 0, the condition required by the estimator that performs least-squares over the ^o-ball — 
namely, the form of Assumption 2 used in Theorem 2(b) — is not stronger than Assumption 3. This 
fact was previously established by Bickel et al. (see p.7, [4]). We now provide a simple pedagogical 
example to show that the i\ -based relaxation can fail to recover the true parameter while the optimal 
£o-based algorithm succeeds. In particular, let us assume that the noise vector w = 0, and consider 
the design matrix 
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corresponding to a regression problem with n = 2 and d = 3. Say that the regression vector /3* G R 3 
is hard sparse with one non-zero entry (i.e., s = 1). Observe that the vector A : = [l 1/3 1/3] 
belongs to the null-space of X, and moreover A G T(l, 1) but A ^ B (2). All the 2 x 2 sub- 
matrices of X have rank two, we have Bo (2) n ker(X) = {0}, so that by known results from Cohen 
et. al. [10] (see, in particular, their Lemma 3.1), the condition Bo(2) n ker(X) = {0} implies that 
the ^o-based algorithm can exactly recover any 1-sparse vector. On the other hand, suppose that, 
for instance, the true regression vector is given by 0* = [l 0] , If applied to this problem with 
no noise, the Lasso would incorrectly recover the solution f3 := [0 —1/3 —1/3] since \\/3\\i = 
2/3 < 1 = ||i. Although this example is low-dimensional ((s, d) = (1, 3)), we suspect that higher 
dimensional examples of design matrices that satisfy the conditions required for the minimax rate but 
not satisfied for £i-based methods may be constructed using similar arguments. This construction 
highlights that there are instances of design matrices X for which £\ -based methods fail to recover the 
true parameter (3* for q = while the optimal phased algorithm succeeds. 

In summary, for the hard sparsity case q = 0, methods based on £\ -relaxation can achieve the 
minimax rate for £ 2 -eiTor, but the current analyses of these £\ -methods [6, 26, 4] are based 

on imposing stronger conditions on the design matrix X than those required by the "optimal" estimator 
that performs least-squares over the ^o-ball. 

4 Proofs of main results 

In this section, we provide the proofs of our main theorems, with more technical lemmas and their 
proofs deferred to the appendices. To begin, we provide a high-level overview that outlines the main 
steps of the proofs. 

Basic steps for lower bounds The proofs for the lower bounds follow an information-theoretic 
method based on Fano's inequality [11], as used in classical work on nonparametric estimation [19, 
34, 35]. A key ingredient is a fine characterization of the metric entropy structure of £ q balls [20, 8]. 
At a high-level, the proof of each lower bound follows the following three basic steps: 

(1) Let || • ||* be the norm for which we wish to lower bound the minimax risk; for Theorem 1, the 
norm || • ||* corresponds to the £ p norm, whereas for Theorem 3, it is the £2 -prediction norm 
(the square root of the prediction loss). We first construct an 5 n -packing set for M q (R q ) in the 
norm || • ||*, where 5 n > is a free parameter to be determined in a later step. The packing 
set is constructed by deriving lower bounds on the packing numbers for M q (R q ); we discuss the 
concepts of packing sets and packing numbers at more length in Section 4.1. For the case of 
£ g -balls for q > 0, tight bounds on the packing numbers in £ p norm have been developed in the 
approximation theory literature [20]. For q = 0, we use combinatorial to bound the packing 
numbers. We use Assumption 2 in order to relate the packing number in the £2 -prediction norm 
to the packing number in £2 -norm. 

(2) The next step is to use a standard reduction to show that any estimator with minimax risk 0(5^) 
must be able to solve a hypothesis-testing problem over the packing set with vanishing error 
probability. More concretely, suppose that an adversary places a uniform distribution over the 
5 n -packing set in M q (R q ), and let this random variable be 6. The problem of recovering 
is a multi-way hypothesis testing problem, so that we may apply Fano's inequality to lower 
bound the probability of error. The Fano bound involves the log packing number and the mutual 
information I(Y; O) between the observation vector y G M. n and the random parameter O 
chosen uniformly from the packing set. 
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(3) Finally, following a technique introduced by Yang and Barron [34], we derive an upper bound 
on the mutual information between Y and © by constructing an e n -covering set for M q (R q ) with 
respect to the ^-prediction semi-norm. Using Lemma 4 in Section 4.1.2, we establish a link 
between covering numbers in £2 -prediction semi-norm to covering numbers in £2 -norm. Finally, 
we choose the free parameters 5 n > and and e n > so as to optimize the lower bound. 

Basic steps for upper bounds The proofs for the upper bounds involve direct analysis of the natural 
estimator that performs least-squares over the £ q -ba\l. The proof is constructive and involves two steps, 
the first of which is standard while the second step is more specific to the problem at hand: 

(1) Since the estimator is based on minimizing the least-squares loss over the ball M q (R q ), some 
straightforward algebra allows us to upper bound the £2 -prediction error by a term that measures 
the supremum of a Gaussian empirical process over the ball M q (2R q ). This step is completely 
generic and applies to any least-squares estimator involving a linear model. 

(2) The second and more challenging step involves computing upper bounds on the supremum of 
the Gaussian process over M q (2R q ). For each of the upper bounds, our approach is slightly 
different in the details. Common steps include upper bounds on the covering numbers of the 
ball M q (2R q ), as well as on the image of these balls under the mapping X : M d — ► W 1 . For the 
case q = 1, we make use of Lemma 2 in order to relate the ^-norm to the ^ 2 - norm for vectors 
that lie in an £ q -bai\. For q € (0, 1), we make use of some chaining and peeling results from 
empirical process theory (e.g., Van de Geer [32]). 

4.1 Packing, covering, and metric entropy 

The notion of packing and covering numbers play a crucial role in our analysis, so we begin with some 
background, with emphasis on the case of covering/packing for £ g -balls. 

Definition 1 (Covering and packing numbers). Consider a metric space consisting of a set S and a 
metric p : S x S — > K + . 

(a) An e-covering of S in the metric p is a collection . . . , P N } C S such that for all f3 € S, 
there exists some i £ {1, . . . , N} with p(@, < e. The e-covering number N(e; S, p) is the 
cardinality of the smallest e-covering. 

(b) A 5-packing of S in the metric p is a collection . . . , f5 M } C S such that p(P l , /3 J ) > 8 for 
all % 7^ j. The 5-packing number M(S; S, p) is the cardinality of the largest 5-packing. 

In simple terms, the covering number N(e;S,p) is the minimum number of balls with radius e 
under the metric p required to completely cover the space, so that every point in S lies in some ball. 
The packing number M (5; S, p) is the maximum number of balls of radius 5 under metric p that can 
be packed into the space so that there is no overlap between any of the balls. It is worth noting that 
the covering and packing numbers are (up to constant factors) essentially the same. In particular, the 
inequalities M(e; S, p) < N(e;S,p) < M(e/2; S, p) are standard (e.g., [29]). Consequently, 
given upper and lower bounds on the covering number, we can immediately infer similar upper and 
lower bounds on the packing number. Of interest in our results is the logarithm of the covering number 
log iV(e; 5, p), a quantity known as the metric entropy. 

A related quantity, frequently used in the operator theory literature [20, 30, 8], are the (dyadic) 
entropy numbers ek(S; p), defined as follows for k = 1, 2, . . . 

e k (S;p) = inf{e>0 | N(e;S,p) < 2^ 1 }. (27) 

By definition, note that we have efe(5; p) < 5 if and only if log N(8; S, p) < k. 
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4.1.1 Metric entropies of ^-balls 



Central to our proofs is the metric entropy of the ball M q (R q ) when the metric p is the £ p -norm, 
a quantity which we denote by logN Ptq (e). The following result, which provides upper and lower 
bounds on this metric entropy that are tight up to constant factors, is an adaptation of results from the 
operator theory literature [20, 17]; see Appendix B for the details. All bounds stated here apply to a 
dimension d > 2. 

Lemma 3. Assume that q G (0, 1] andp G [1, oo] with p > q. Then there is a constant U Q)P , depending 
only on q and p, such that 



logiV( e ) < U q>p R q — (-)*-« log d forallee^Rg 1 ^). 



(28) 



Conversely, suppose in addition that e < 1 and e p = f2 ( 1 



" for some fixed v G (0, 1), depending 
only on q and p. Then there is a constant L q>p < U q>p , depending only on q and p, such that 

. 1 pi 



logiVp !9 (e) > L q . p 



R q p'-i (-) p-i logd 



(29) 



Remark: In our application of the lower bound (29), our typical choice of e p will be of the order 

p — q 

^( n ) 2 • 11 can be verir i ed that as l° n g as tnere exists a k G (0, 1) such that - ^ q/2 = £l(d K ) 
(which is stated at the beginning of Section 2) and p > q, then there exists some fixed v G (0, 1), 
depending only on p and q, such that e lies in the range required for the lower bound (29) to be valid. 



4.1.2 Metric entropy of g-convex hulls 

The proofs of the lower bounds all involve the Kullback-Leibler (KL) divergence between the distribu- 
tions induced by different parameters j3 and j3' in M q (R q ). Here we show that for the linear observation 
model (1), these KL divergences can be represented as g-convex hulls of the columns of the design 
matrix, and provide some bounds on the associated metric entropy. 

For two distributions F and Q that have densities dP and dQ with respect to some base measure 
fi, the Kullback-Leibler (KL) divergence is given by D(P \\ Q) = / log J| P(dp,). We use to 
denote the distribution of y G M under the linear regression model — in particular, it corresponds to the 
distribution of a N{X(3, a 2 I nX n) random vector. A straightforward computation then leads to 

D(P/3 HP/30 = ^\\X(3-Xf3'\\l (30) 

Therefore, control of KL-divergences requires understanding of the metric entropy of the g-convex 
hull of the rescaled columns of the design matrix X — in particular, the set 

d 

absconv g (X/^) := {— ^ OjXj \ G B ff (l)}. (31) 

We have introduced the normalization by 1/ yjn for later technical convenience. 

Under the column normalization condition, it turns out that the metric entropy of this set with 
respect to the ^-norm is essentially no larger than the metric entropy of M q (R q ), as summarized in 
the following 
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Lemma 4. Suppose that X satisfies the column normalization condition (Assumption 1 with constant 
K c ). Then there is a constant U' q 2 depending only on q G (0, 1] such that 

log AT(e,absconv 9 (X/v^), II • h) < U' q>2 

The proof of this claim is provided in Appendix C. Note that apart from a different constant, this 
upper bound on the metric entropy is identical to that for log N2 t q{e/n c ) from Lemma 3. Up to 
constant factors, this upper bound cannot be tightened in general (e.g., consider n = d and X = I). 



R q ~*{—) 2 -« \ogd . 



4.2 Proof of lower bounds 

We begin by proving our main results that provide lower bounds on minimax risks, namely Theorems 1 
and 3. 



4.2. 1 Proof of Theorem 1 

Recall that the lower bounds in Theorem 1 are the maximum of two expressions, one corresponding 
to the diameter of the set J\f q (X) intersected with the £ g -ball, and the other correspond to the metric 
entropy of the ^-ball. 

We begin by deriving the lower bound based on the diameter of N q (X) = M q (R q ) n ker(X). The 
minimax risk is lower bounded as 

min max mW - (3\\ p n > min max E\\f3 - f3\\ p 

P !3& q {B. q ) " P p !3&M q {X) Up 

where the inequality follows from the inclusion M q (X) C M q (R q ). For any f3 G M q (X), we have Y = 
X(5 + w = w, so that Y contains no information about G N q (X). Consequently, once j3 is chosen, 
the adversary can always choose an element [3 G N q (X) such that \\/3 — j3\\ p > \ diam p (A/'q(X)). 
Indeed, if ||/3|| p > \ diam p (A/ r g (X)), then the adversary chooses (3 = G M q {X). On the other 
hand, if \\(3\\ p < \ diam p (A/"q(X)), then the adversaiy can choose some f3 G M q (X) such that \\(3\\ p = 
di&m p (Af q (X)). By triangle inequality, we then have — /3|| p > \\/3\\ p — \\/3\\ p > \ diam p (A/"g(X)). 
Overall, we conclude that 

min max E\\0 - P\\> > (J diam p (A^(X))) p . 

In the following subsections, we establish the second terms in the lower bounds via the Fano method, 
a standard approach for minimax lower bounds. Our proofs of part (a) and (b) are based on slightly 
different arguments. 



Proof of Theorem 1(a): Let M = M p (5 n ) be the cardinality of a maximal packing of the ball 
M q (R q ) in the l p metric, say with elements . . . , f3 M }. A standard argument (e.g., [18, 34, 35]) 
yields a lower bound on the minimax £ p -nsk in terms of the error in a multi-way hypothesis testing 
problem: in particular, we have 
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where the random vector B G M. d is uniformly distributed over the packing set ... , /3 M }, and the 
estimator j3 takes values in the packing set. Applying Fano's inequality [11] yields the lower bound 



F[B ^ (3] > 1 



I(B;Y) + log 2 

io g M„(<y ' 



(32) 



where /(-B; 1") is the mutual information between random parameter B in the packing set and the 
observation vector Y G W 1 . 

It remains to upper bound the mutual information; we do so by following the procedure of Yang 
and Barron [34], which is based on covering the model space {P^, f3 G M q (R q )} under the square-root 
Kullback-Leibler divergence. As noted prior to Lemma 4, for the Gaussian models given here, this 

|X(/3-/3')|| 2 .LetiV = iV 2 ( en )be 
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square-root KL divergence takes the form yjD(Fp || P^T) 
the minimal cardinality of an e n -covering of M q (R q ) in ^2-norm. Using the upper bound on the dyadic 
entropy of absconVq(X) provided by Lemma 4, we conclude that there exists a set {Xfi 1 , . . . , Xf3 N } 
such that for all Xf3 G absconv g (X), there exists some index i such that \\X(f3— [3 ^ )\\ 2 /^/n < ck c e n . 
Following the argument of Yang and Barron [34], we obtain that the mutual information is upper 
bounded as 



I(B-Y) < logN(e n ) + 



c 2 n 



a 



K 2 f 2 

2 r "c c n - 



Combining this upper bound with the Fano lower bound (32) yields 

\ogN 2 (e n ) + ^K 2 c e 2 n + log2 



[Bjkfl > 1 



log M p (S n ) 



(33) 



The final step is to choose the packing and covering radii (5 n and e n respectively) such that the lower 
bound (33) remains strictly above zero, say bounded below by 1/4. In order to do so, suppose that we 
choose the pair (e n , 5 n ) such that 



c_n 2 2 

2 c n 

log M p (5 n ) 



< 
< 



logN 2 (e n ) 
4logN 2 {e r 



and 



(34a) 
(34b) 



As long as N 2 (e n ) > 2, we are then guaranteed that 

21ogiV 2 ( en )+log2 



[Bjtfl > 1 



41ogA^ 2 (en) 



> 1/4, 



(35) 



as desired. 

It remains to determine choices of e n and 5 n that satisfy the relations (34). From Lemma 3, 



relation (34a) is satisfied by choosing e n such that 
alently such that 



c n K 2 e 2 



L 



9,2 



(i) 



_2q_ 

2 ~i logd 



, or equiv- 



(*0 



2-9 



6(^*4^). 

Hit n 



In order to satisfy the bound (34b), it suffices to choose 5 n such that 



U, 



q,p 



P . 1 . _E2_ 
On 



< 4L, 



9,2 



Rq^ 



1 23_ 

( — ) 2 ~i logd 
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or equivalently such that 

L 4L gj2 J 9 ' Lk^ n J 

Combining this bound with the lower bound (35) on the hypothesis testing error probability and sub- 
stituting into equation (10), we obtain 



min max Ell/3— 8\\p > c av R a 



in 2 , n 



p-q 



a 1 \ogd - 



which completes the proof of Theorem 1(a). 

Proof of Theorem 1(b): In order to prove Theorem 1(b), we require some definitions and an auxil- 
iary lemma. For any integer s G {1, . . . , d}, we define the set 

H(s) := {zG {-l,0,+l} d | |M|o = s}. 



Although the set TL depends on s, we frequently drop this dependence so as to simplify notation. We 



define the Hamming distance pn{z,z') = Ylj=i ^i z j / z 'j\ between the vectors z and z' . We prove 



the following result in Appendix D: 

^2 1U & s/2 



Lemma 5. There exists a subset TL C TL with cardinality \TL\ > exp(| log ^-/4) such that pn(z, z') > 



\for all z, z' G TL. 



Now consider a rescaled version of the set TL, say y j:5 n TL for some 5 n > to be chosen. For any 
elements f3, (3' G we have the following bounds on the ^-norm of their difference: 



\\(3-(3' ||| > 5 2 n , and (36a) 
WP-P'WI < 85 2 n . (36b) 

Consequently, the rescaled set y^SnTL is an S n -packing set in £2 norm with M2(S n ) = \TL\ elements, 

say {P l , . . . , /3 M }. Using this packing set, we now follow the same classical steps as in the proof of 
Theorem 1(a), up until the Fano lower bound (32). 

At this point, we use an alternative upper bound on the mutual information, namely the bound 
I(Y; B) < jjp- YlfiJ^j D{P l || which follows from the convexity of mutual information [1 1]. For 

the linear observation model (1), we have D(f3 i \\ (3 j ) = ^{{X^ - ^')|||. Since (/?-/?') G B (2s) 
by construction, from the assumptions on X and the upper bound bound (36b), we conclude that 

I(Y-B) < 

Substituting this upper bound into the Fano lower bound (32), we obtain 

£ log ^ 
2 1U & ^72 
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Setting 5 2 = j2 fr ^ l°g ensures that this probability is at least 1/4. Consequently, combined 
with the lower bound (10), we conclude that 



mm max 



Ell/? -011? > ^7(^) P/2 



11,1 _/o rcr 2 S , d 



log' 



$ (Bm q (R q ) p ~ 2P4 v 32 y Ul2n fa a/2 

As long as the ratio d/s > 1 + S for some 5 > we have log(d/ s — 1) > c log(cZ/s) for some constant 
c > 0, from which the result follows. 

4.2.2 Proof of Theorem 3 

We use arguments similar to the proof of Theorem 1 in order to establish lower bounds on prediction 
error \\X@ - p*)\\ 2 /y/n. 

Proof of Theorem 3(a): For some 5 2 n = n(R q { X -^) l ~ q l 2 ), let {/3\ . . . ,/3 M } be an 5 n packing of 
the ball M q (R q ) in the l 2 metric, say with a total of M = M(5 n /n c ) elements. We first show that if 
n is sufficiently large, then this set is also a At^/2-packing set in the prediction (semi)-norm. From 
Assumption 2, for each i ^ j, 

IW * ~^ )h > KtWr-Ph-hiB^n,®. (37) 

Using the assumed lower bound on 5 2 — namely, 5 2 = 2 ) — and the initial lower 

1 1 v / oi aj \\\ 

bound (37), we conclude that m > Ki5 n /2 once n is larger than some finite number. 

We have thus constructed a K^5 n /2-packing set in the (semi)-norm ||X(/3* — /J 7 )]^- As in the 
proof of Theorem 2(a), we follow a standard approach to reduce the problem of lower bounding the 
minimax error to the error probability of a multi-way hypothesis testing problem. After this step, we 
apply the Fano inequality to lower bound this error probability via 

log M 2 (5 n ) 

where I(XB l ; Y) now represents the mutual information 3 between random parameter XB (uniformly 
distributed over the packing set) and the observation vector Y £ W 1 . 

From Lemma 4, the k c e-covering number of the set absconVq(X) is upper bounded (up to a con- 
stant factor) by the e covering number ofE q (R q ) in ^2-norm, which we denote by N 2 (e n ). Following 
the same reasoning as in Theorem 2(a), the mutual information is upper bounded as 

I(XB;Y) < log N 2 (e n ) + ^Kl e 2 n . 

Combined with the Fano lower bound, we obtain 

log N 2 (e„) + 4 k 2 c el + log 2 
log Mp{d n ) 

Lastly, we choose the packing and covering radii (5 n and e n respectively) such that the lower bound (38) 
remains strictly above zero, say bounded below by 1/4. It suffices to choose the pair (e n , 5 n ) to satisfy 
the relations (34a) and (34b). As long as e 2 n > and N 2 {e n ) > 2, we are then guaranteed that 

*[XB*X0\ > 1 _ ^ogiV 2 (e ) + log2 > 

41ogN 2 (e n ) 



3 Despite the difference in notation, this mutual information is the same as I(B; Y), since it measures the information 
between the observation vector y and the discrete index i. 
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as desired. Recalling that we have constructed a 5 n m/2 covering in the prediction (semi)-norm, we 
obtain 



min max E\\X (3- 0)\\l/n > (L„R Q i$ 

p f3& q (R q ) ^ ™' - 2 ' q q 



a logd 



Ik* n 



l-q/2 



for some constant d 2 „ > 0. This completes the proof of Theorem 3(a). 

Proof of Theorem 3(b): Recall the assertion of Lemma 5, which guarantees the existence of a set 
is an 5 n -packing set in ^-norm with M p (8 n ) = \7i\ elements, say {/3 1 , . . . ,{3 M }, such that 
the bounds (36a) and (36b) hold, and such that log \ 7i\ > § log ^fm. By construction, the difference 
vectors (j3 l — ft) G Bo (2s), so that by assumption, we have 

\\X{ft -ft)\\/^i < K u \\ft-ft\\ 2 < K u V85 n . (39) 

In the reverse direction, since Assumption 2 holds with fi(R q , n, d) = 0, we have 

\\X(f3 l - {3i)\\ 2 /yfr > K e 5 n . (40) 

We can follow the same steps as in the proof of Theorem 1(b), thereby obtaining an upper bound the 
mutual information of the form I(XB; y) < 8k^ nb\ . Combined with the Fano lower bound, we have 

2n lo S s/2 

Remembering the extra factor of ki from the lower bound (40), we obtain the lower bound 

min max E— 1| X(P — > c' „ k| — j s log ■ 



p /3SB (s) 71 11 ^' UZ ~ U ' q Kl ° S/2 

Repeating the argument from the proof of Theorem 1 (b) allows us to further lower bound this quantity 
in terms of \og(d/ s), leading to the claimed form of the bound. 

4.3 Proof of achievability results 

We now turn to the proofs of our main achievability results, namely Theorems 2 and 4, that provide 
upper bounds on minimax risks. We prove all parts of these theorems by analyzing the family of 
M-estimators 

P G arg min \\Y-Xf3\\l 

We begin by deriving an elementary inequality that is useful throughout the analysis. Since the 
vector fi* satisfies the constraint \\f3*\\q < R q meaning j3* is a feasible point, we have \\Y — XfiW^ < 
|| Y — X(3* Hi- Defining A = j3 — j3* and performing some algebra, we obtain the inequality 

1„ f„o 2\w T XA\ 

- XA\\ 2 2 < L. (41) 

n n 

4.3.1 Proof of Theorem 2 

We begin with the proof of Theorem 2, in which we upper bound the minimax risk in squared l 2 -norm. 
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Proof of Theorem 2(a): To begin, we may apply Assumption 2 to the inequality (41) to obtain 

[max(0, k*||A|| 2 - fi(R q ,n,d))] 2 < 2\w T XA\/i 



< -II^XlloollAlli. 



2 

n 

Since W{ ~ N(0,a 2 ) and the columns of X are normalized, each entry of ^w T X is zero-mean 
Gaussian with variance at most 4a 2 K 2 /n. Therefore, by union bound and standard Gaussian tail 
bounds, we obtain that the inequality 



/ 3 lo d 

[max(0, k £ \\A\\ 2 - f e (R q ,n,d))] 2 < 2aK c J — ^- || A||i (42) 

holds with probability greater than 1 — c\ exp(— C2n). Consequently, we may conclude that at least 
one of the two following alternatives must hold 

ii a ii ^ 2 fe(Rq,n,d) 

|| ZX || 2 < j or (43a) 



,^,,9 2ok c /31og d .. ~ .. ,,„,s 

A 2 < — — A i. (43b) 

ni V n 



Suppose first that alternative (43a) holds. Consequently for we have 

ll^<o(fl,(!fV-" 2 ), 



which is the same up to constant rate than claimed in Theorem 2(a). 

On the other hand, suppose that alternative (43b) holds. Since both j3 and j3* belong to K q (R q ), we 

have ||A||| = Yfj=i l A jl 9 < 2R g- Therefore we can exploit Lemma 2 by setting r = ^\J^^, 

thereby obtaining the bound ||A||2 < r||A||i, and hence 

||A||1 < ^2R~ q r 1 ^ 2 \\A\\ 2 + 2R q T 2 - ( '. 

Viewed as a quadratic in the indeterminate x = ||A||2, this inequality is equivalent to the constraint 

f(x) = ax 2 + bx + c < 0, with a = 1, 

b = -^j2R q T l - q/2 , and c = -2R q T 2 ~ q . 

Since /(0) = c < and the positive root of f(x) occurs at x* = (—b + Vb 2 — 4ac)/(2a), some 
algebra shows that we must have 



|A||| < 4max{6 2 , |c|} < 24R q 



k 2 a 2 log d 

Kj I) Kj n Th 



1-9/2 



with high probability (stated in Theorem 2(a) which completes the proof of Theorem 2(a). 

Proof of Theorem 2(b): In order to establish the bound (13), we follow the same steps with fe(s, n, d) 
0, thereby obtaining the following simplified form of the bound (42): 



Kg Kg V n 
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By definition of the estimator, we have ||A||o < 2s, from which we obtain ||A||i < \/2s|| A lb- 
Canceling out a factor of || A|b from both sides yields the claim (13). 

Establishing the sharper upper bound (14) requires more precise control on the right-hand side of 
the inequality (41). The following lemma, proved in Appendix F, provides this control: 



Lemma 



6. If ~/^pjr < K u for all 9 € Bq(2s), then for any r > 0, we have 



sup h v FX0\ < 6ar Ku J Sl ° g{d/s) (44) 

\\6\\o<2s,\\e\\ 2 <r n V n 

with probability greater than 1 — c\ exp(— C2 min{n, s log(d — s)}). 

Let us apply this lemma to the basic inequality (41). We may upper bound the right -hand side as 



,/lA, IIAII 1, T . IIAII ls\og(d/s) 
< A 2 sup -\w T Xe\ < 6 A 2 o k u \ &y ' 1 . 

n \\e\\o<2s,\\0h<l n v n 

Consequently, we have 



i||XA||l < 12.||A|| 2 K M /iM^, 
n V n 

with high probability. By Assumption 2, we have ||XA|||/n > «|||A|||. Cancelling out a factor of 

|| A || 2 and re-arranging yields II A II 2 < 12 \ slo s( d / s ) w j tn probability as claimed. 

Kg v n 

4.3.2 Proof of Theorem 4 

We again make use of the elementary inequality (4 1 ) to establish upper bounds on the prediction risk. 

Proof of Theorem 4(a): So as to facilitate tracking of constants in this part of the proof, we consider 
the rescaled observation model, in which w ~ A r (0, I n ) and X : = a~ l X. Note that if X satisfies 
Assumption 1 with constant k c , then X satisfies it with constant k c = n c /a. Moreover, if we establish 
a bound on || X ((3 — /3*)||l/ ri ' tnen multiplying by a 2 recovers a bound on the original prediction loss. 
We first deal with the case q = 1. In particular, we have 



\-^xe\ < ||^|U||% < 

1 n 1 n y n 

where the second inequality holds with probability 1 — c\ exp(— C2 logd), using standard Gaussian 
tail bounds. (In particular, since 1 1 1 1 2 / — / ^c> the variate w 1 X/n is zero-mean Gaussian with 
variance at most k c 2 /n.) This completes the proof for q = 1. 

Turning to the case q £ (0, 1), in order to establish upper bounds over M q (2R q ), we require the 
following analog of Lemma 6, proved in Appendix G.l. So as to lighten notation, let us introduce the 
shorthand g(R q , n,d) := v ^(^)l~l 

Lemma 7. For q £ (0, 1), suppose that g(R q , n, d) = o(l) and d = f2(n). Then for any fixed radius 

q 

r such that r > CsK c 2 g(R g ,n, d)for some numerical constant C3 > 0, we have 

1 \~Tvn\ ~ 2 frT /1°§ d l_q 



\w T X9\ < c 4 r k c ? s/R~ q {-^-) 2^4 ; 



sup 

ii?fliio ti ' ' ' ' ' n 

e& q (2R q ), ^^<r 

with probability greater than 1 — c\ exp(— C2 n g 2 (R q , n, d)). 
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Note that Lemma 7 above holds for any fixed radius r > csk c 2 g(R q ,n, d). We would like the 

II X A. 1 1 

apply the result of Lemma 7 to r = 11 -t= 1 , which is a random quantity. In Appendix H, we state 
and prove a "peeling" result that allows us to strengthen Lemma 7 in a way suitable for our needs. In 
particular, if we define the event 



£ := {3 9e M q {2R q ) such that - \w T X9\ > c 4 " '' k c ^ yJ~R q (-5-)2~4}, (45) 



n ^/n n 

then we claim that 

2 exp(—cng 2 (R q , n, d)) 



F[£] < 



1 — exp(— cng 2 (R q , n, d)) 



This claim follows from Lemma 9 in Appendix H by making the choices f n (v;X n ) = ^\w T Xv\, 
p{v) = Mga and 5(r) = C3 r ^§ yr^ 

Returning to the main thread, from the basic inequality (41), when the event £ from equation (45) 
holds, we have 



n Wn V n 



Canceling out a factor of " ^ , squaring both sides, multiplying by a 2 and simplifying yields 

\\XA\\ 2 c 2 a 2 { ^y R MAy^ = ^^^(4^)1-^ 



re an' k£ n 

as claimed. 

Proof of Theorem 4(b): For this part, we require the following lemma, proven in Appendix G.2: 
Lemma 8. Suppose that ^ > 2. Then for any r > 0, we have 



sup —\w X6\ < 9ra 

e&0 (2s),^M<r n V 71 



d 



with probability greater than 1 — exp ( — 10s log^)). 

Consequently, combining this result with the basic inequality (41), we conclude that 



re y/n V n 

with high probability, from which the result follows. 

5 Discussion 

The main contribution of this paper was to analyze minimax rates of convergence for the linear 
model (1) under high-dimensional scaling, in which the sample size n and problem dimension d 
tend to infinity. We provided lower bounds for the £ p -norm for all p £ [1, oo] with p / q, as well 
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as for the £2 -prediction loss. In addition, for both the ^2-loss and £2 -prediction loss, we derived a set 
of upper bounds that match our lower bounds up to constant factors, so that the minimax rates are 
exactly determined in these cases. The rates may be viewed as an extension of the rates for the case 
of £2 -loss from Donoho and Johnstone [14] on the Gaussian sequence model to more general design 
matrices X. In particular substituting X = I and d = n into Theorems 1 and 2, yields the same rates 
as those expressed in Donoho and Johnstone [14] (see Corollary 1), although they provided much 
sharper control of the constant pre-factors than the analysis given here. 

Apart from the rates themselves, our analysis highlights how conditions on the design matrix X 
enter in complementary manners for different loss functions. On one hand, it is possible to obtain 
lower bounds on ^-risk (see Theorem 1) or upper bounds on ^-prediction risk (see Theorem 4) under 
very mild assumptions on X — in particular, our analysis requires only that the columns of Xj^fn 
have bounded ^-norms (see, in particular, Assumption 1). On the other hand, in order to obtain 
upper bounds on I2 risk (Theorem 2) or lower bound on £ 2 -norm prediction risk (Theorem 3), the 
design matrix X must satisfy, in addition to column normalization, other more restrictive conditions. 
In particular, our analysis was based on imposed on a certain type of lower bound on the curvature 
of X T X measured over the £ g -ball (see Assumption 2). As shown in Lemma 1, this lower bound is 
intimately related to the degree of non-identifiabUity over the £ 9 -ball of the high-dimensional linear 
regression model . 

In addition, we showed that Assumption 2 is not unreasonable — in particular, it is satisfied with 
high probability for broad classes of Gaussian random matrices, in which each row is drawn in an i.i.d. 
manner from a N(0, S) distribution (see Proposition 1). This result applies to Gaussian ensembles 
with much richer structure than the standard Gaussian case (S = Idxd)- Finally, we compared to the 
weakest known sufficient conditions for l\ -based relaxations to be consistent in £2 -norm for q = — 
namely, the restricted eigenvalue (RE) condition, of Bickel et al. [4] and showed that the oracle least- 
squares over the 4rball method can succeed with even milder conditions on the design. In addition, we 
also proved that the RE condition holds with high probability for broad classes for Gaussian random 
matrices, as long as the covariance matrix £ is not degenerate. The analysis highlights how the 
structure of X determines whether ^i-based relaxations achieve the minimax optimal rate. 

The results and analysis from our paper can be extended in a number of ways. First, the assump- 
tion of independent Gaussian noise is somewhat restrictive and it would be interesting to analyze the 
model under different noise assumption, either noise with heavier tails or some degree of dependency. 
In addition, we are currently working on extending our analysis to non-parametric sparse additive 
models. 
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A Proof of Lemma 2 

Defining the set S = { j \ \9j\ > r}, we have 

11% = ii%Ii + En ^ V\s\\\0h+r^2 1 -^- 

Since \0j\/t < 1 for all i ^ we obtain 

||% < VW\¥h+rY,{\0i\/r) q 

Finally, we observe 2R q > J2j^s 1% I 9 — l^l 1 " 9 ' fr° m which the result follows. 



B Proof of Lemma 3 



The result is obtained by inverting known results on (dyadic) entropy numbers of i q -balls; there are 
some minor technical subtleties in performing the inversion. For a d-dimensional t q ball with q £ 
(0, p), it is known [30, 20, 17] that for all integers k £ [log d, d], the dyadic entropy numbers of the 
ball Bq(l) with respect to the ^ p -norm scale as 



a 



<i-i> 



log(l + I) 



(46) 



Moreover, for k £ [1, log d], we have e^^) < C giP . 

We first establish the upper bound on the metric entropy. Since d > 2, we have 



< a 



q.p 



log(l + 



< a 



q.p 



log <i 



l/q-l/p 



Inverting this inequality for k = log N p ^ q (e) and allowing for a ball radius R q yields 



logiV(e) < (C, 



R„ l/q , 



q,p 



) p-« log d, 



(47) 



as claimed. 

We now turn to proving the lower bound on the metric entropy, for which we require the existence 
of some fixed v £ (0, 1) such that k < d x ~ v . Under this assumption, we have 1 + | > $ > d u , and 
hence 



q,p 



~ log(l+j) 
k 



X/q-l/p 



> c, 



q,p 



v\ogd 
k 



l/q-l/p 



Accounting for the radius R q as was done for the upper bound yields 

logAWe) > V (^A )^iogd, 

as claimed. 

Finally, let us check that our assumptions on k needed to perform the inversion are ensured by 
the conditions that we have imposed on e. The condition k > logd is ensured by setting e < 1. 
Turning to the condition k < d x ~ v ', from the bound (47) on k, it suffices to choose e such that 

C pq i j p—q 

(— f^) v ~ q logd < d 1- "". This condition is ensured by enforcing the lower bound e p = q 
for some v £ (0, 1). 
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C Proof of Lemma 4 

We deal first with (dyadic) entropy numbers, as previously defined (27), and show that 

e2k-i(abBeanv q (X/y/n), \\ ■ || 2 ) < ck c minjl, ( ^2g^±l2 ) Hj, (48) 

We prove this intermediate claim by combining a number of known results on the behavior of dyadic 
entropy numbers. First, using Corollary 9 from Guedon and Litvak [17], for all k = 1, 2, . . ., we have 

e 2 fc-i(absconv 9 (X/\/n), || ■ || 2 ) < c e fe (absconvi(X), || ■ || 2 ) minjl, ( ° S ^~ * l l. 
Using Corollary 2.4 from Carl and Pajor [7], we obtain 

e fc (absconvi(X/V^, || • || 2 ) < -j= |X||i^ 2 minjl, (^1±£) 1/2 |, 

where |||X|||i^ 2 denotes the norm of X viewed as an operator from if — > P^. More specifically, we 
have 

— ^= HI -X" HI x — »2 = —j= sup ||-Xtt||2 
V n V n ||u||i=l 

1 m 

= —= sup sup v Xu 

V n ||i,|| 2 =i ||n||i=i 

= max ||Xj|| 2 /\/n < k c . 

i=l,...,d 

Overall, we have shown that e 2 fe_i(absconv g (X/- v /n), || • || 2 ) < ck c minjl, ( lQg ^ +I ^ ) q 5 |, 

as claimed. Finally, under the stated assumptions, we may invert the upper bound (48) by the same 
procedure as in the proof of Lemma 3 (see Appendix B), thereby obtaining the claim. 

D Proof of Lemma 5 

In this appendix, we prove Lemma 5. Our proof is inspired by related results from the approximation 
theory literature (see, e.g., Kuhn [20]). For each even integer s = 2, 4, 6, . . . , d, let us define the set 

H:= {z£{-l,0,+l} d | |M|o = s}- (49) 

Note that the cardinality of this set is \H\ = (f)2 s , and moreover, we have \\z — ^\\o < 2s for all pairs 

z, z' E H. We now define the Hamming distance pn onH xTt via ph{z, z') = X^f=i ^l z j ^ z 'j\- F° r 
some fixed element z G H, consider the set {z' G 7i \ pn{z, z') < s/2}. Note that its cardinality is 
upper bounded as 

\{z'GH | p H (z,z')<s/2}\ < ( IV 2 . 

To see this, note that we simply choose a subset of size s/2 where z and z' agree and then choose the 
other s/2 co-ordinates arbitrarily. 
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Now consider a set A C H with cardinality at most |„4| <m:= -rTx- The set of elements z G H 

\s/2j 

that are within Hamming distance s/2 of some element of A has cardinality at most 

\{z £7i | : p H (z, z) < s/2 for some z' G A}\ < \A\ ( d .A3 s / 2 < \H\, 

where the final inequality holds since 771 ( s / 2 )^ 2 ^ 1^1- Consequently, for any such set with cardi- 
nality \A\ < m, there exists a. z £ Tt such that pn(z, z') > s/2 for all z' G A By inductively adding 
this element at each round, we then create a set with A C TL with \ A\ > m such that ph{z, z') > s/2 
for all z,z' £ A. 

To conclude, let us lower bound the cardinality m. We have 

{d-s/2)\(s/2)\ _ s ^ d-s + j ,d-s r / 2 
y 2 ) A = l s/2 + i - ^ s / 2 ^ ' 

where the final inequality uses the fact that the ratio is decreasing as a function of j. 

E Proof of Proposition 1 

In this appendix, we prove both parts of Proposition 1. In addition to proving the lower bound (24), 
we also prove the analogous upper bound 



\Xvh oiivi/2 II i r 
< 3 2J ' V 2 + 6 



n 



p(S)logd 



n 



1/2 

|b|h for all v G R d . (50) 



Our approach to proving the bounds (24) and (50) is based on Slepian's lemma [23, 12] as well as 
an extension thereof due to Gordon [15]. For the reader's convenience, we re-state versions of this 
lemma here. Given some index set U x V, let {Y UjV , (u, v) G U x V} and {Z UjV , (u, v) G U x V} 
be a pair of zero-mean Gaussian processes. Given the semi-norm on these processes defined via 
u(X) = Epf 2 ] x / 2 , Slepian's lemma asserts that if 

&(Y u ,v — Yu',v>) < c(Z u ,v — Z u /y) for all (u, v) and (u',v') in U x V, (51) 

then 

E[ sup y u ,„]<E[ sup Z U)V ]. (52) 

(u,«)6C/xV (u,ti)e(/xV 

One version of Gordon's extension [15, 23] asserts that if the inequality (51) holds for (u, v) and 
(u', v') in U x V, and holds with equality when u = ?/, then 

E[sup infF u „] < E[sup inf Z uv ]. (53) 

Turning to the problem at hand, any random matrix X from the given ensemble can be written as 
WTi 1 / 2 , where W G M. nxd is a matrix with i.i.d. N(0, 1) entries, and E 1 / 2 is the symmetric matrix 
square root. We choose the set U as the unit ball 5 n_1 = {u G M n | [| 2 = 1}, and for some radius 
r, we choose V as the set 

y(r) : = {v G E d I ||S 1/2 u|| 2 = 1, ||u||* < r}. 
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(Although this set may be empty for certain choices of r, our analysis only concerns those choices for 
which it is non-empty.) For a matrix M, we define the associated Frobenius norm |||M|||i? = -Mf-] 1 / 2 , 
and for any v E V(r), we introduce the convenient shorthand v = S 1 / 2 v. 

With these definition, consider the centered Gaussian process Y UjV = u T Wv indexed by S n ~ l x 
V(r). Given two pairs (u, v) and (u', v') in S 1 ™ -1 x V(r), we have 



v 2 (Y u ,v-Y u , :V ,) = |||^ T -?/(t/) T |||! 

~T l ~T I l ■ 
'J — U V + U 1 

l~l|2|| /||2 , II /||2n~ ~/||2 , nf T I \\ / 1|2\ / 1| ~ ||2 ~T~l\/r A \ 

\v\\ 2 \\u — u || 2 + \\u \\ 2 \\v — v || 2 + 2(n n — ||u H2XIMI2 — v v )(54) 



II ~T I ~T , I ~T l/~l\T\\\2 
\UV — U V + U V — U (V ) \j F 



Now by the Cauchy-Schwarz inequality and the equalities ||ii||2 = ||^'||2 = 1 and \\v\\2 = \\v'\\2, we 
have u T v! — \\u\\\ < 0, and ||t;||| — v T v' > 0. Consequently, we may conclude that 



o" (Y UiV - Y u 'y) < \\u- u || 2 + \\v -v || 2 . (55) 

We claim that the Gaussian process Y UjV satisfies the conditions Gordon's lemma in terms of the zero- 
mean Gaussian process Z UjV given by 

Z u ,v = g T u + h T (T}' 2 v), (56) 

where g £ M. n and h € M. d are both standard Gaussian vectors (i.e., with i.i.d. iV(0, 1) entries). To 
establish this claim, we compute 



a 2 (Z U:V - Z u i v i) = \\u - u'\\l + ||E 1/2 ( 



v-v')\\ 2 2 



I 'II 2 1 n~ ~/||2 

\U — U || 2 + \\V — V || 2 . 



Thus, from equation (55), we see that Slepian's condition (51) holds. On the other hand, when v = v', 
we see from equation (54) that 



u,v 1 1 



& {Y u ,v ~ Y u ' tV ) — \\u — u'W?, — a 2 (Z u>v — Z, 
so that the equality required for Gordon's inequality is also satisfied. 

Establishing an upper bound: We begin by exploiting Slepian's inequality (52) to establish the 
upper bound (50). We have 

E[ sup \Xv\i\ = E[ sup u T Xv] 

v£V(r) (u,v)£S n - 1 xV(r) 

< E[ sup Z UjV ] 

(u,v)eS n ~ 1 xV(r) 

= E[ sup g T u] +E[ sup h T {T} /2 v)\ 

1 1 2=1 »€V(r) 

< E[||5|| 2 ] +E[ sup h T {T} /2 v)\. 

veV(r) 



By convexity, we have E[||g||2] < vEuMlil = V 7 ™* from which we can conclude that 

E[ sup \\Xv\\ 2 ] < V^ + E[sup h T {T} /2 v)}. (57) 

v&V(r) veV(r) 
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Turning to the remaining term, we have 

sup \h T (Y}l 2 v)\ < sup |M|i HX^/iHoo < rW^^hWco. 
v&V(r) veV(r) 

Since each element (E 1 / 2 ^ is zero-mean Gaussian with variance at most p(E) = maxj standard 
results on Gaussian maxima (e.g., [23]) imply that E^E 1 / 2 /^^] < -y/3p(E) logd. Putting together 
the pieces, we conclude that for q = 1 

E[ sup \\Xv\\ 2 /V^] < 1+ [3p(E)^] 1/2 r. (58) 

veV(r) v , 

t u (r) 

Having controlled the expectation, it remains to establish sharp concentration. Let / : R-° — ► R be 
Lipschitz function with constant L with respect to the t 2 -norm. Then if w ~ N(0, Idxd) is standard 
normal, we are guaranteed [22] that for all t > 0, 

P[\f(w)-E[f(w)]\>t] < 2exp(-^). (59) 

Note the dimension-independent nature of this inequality. We apply this result to the random matrix 
W G R nxd , viewed as a standard normal random vector in D = nd dimensions. First, letting 

f(W) = sup v£V(r) \\WY}/ 2 v\\2/^i, we find that 

Vn[f(W) - f(W')] = sup \\WT}/ 2 v\\ 2 - sup ||WE 1/2 v|| 2 

v£V(r) veV(r) 

< sup W^vhUW -W')\\ F 

«eV(r) 

= \\w - W'\\ F 

since 1 1 5H 1//2 ^ 1 1 2 = 1 for all v E V(r). We have thus shown that the Lipschitz constant L < l/yfn. 
Recalling the definition of t u (r) from the upper bound (58), we set t = t u (r)/2 in the tail bound (59), 
thereby obtaining 

P[ sup \\Xv\\ 2 > lt u (r;q)] < 2exp(-n^^). (60) 

v£V(r) 1 ° 

We now exploit this family of tail bounds to upper bound the probability of the event 

T := {3 v e R d s.t \\Z 1/2 v\\ 2 = land \\Xv\\ 2 > 3t u (||u||i)}. 

We do so using Lemma 9 from Appendix H. In particular, for the case £ = T, we may apply this 
lemma with the objective functions f(v ; X) = ||X«||2, sequence a n = n, the constraint p(-) = || • ||i, 
the set S = {v G R d \ ||E 1/2 t)||2 = 1}, and g(r) = 3t u (r)/2. Note that the bound (60) means 
that the tail bound (65) holds with c = 4/72. Therefore, by applying Lemma 9, we conclude that 
R[T] < c\ exp(— c 2 n) for some numerical constants Cj. 

Finally, in order to extend the inequality to arbitrary v G R d , we note that the rescaled vector 
v = v /||E 1//2 i;||2 satisfies 1 1 1 1 2 = 1- Consequently, conditional on the event T c , we have 



\\Xv\\ 2 /^i < 3 + 3[V(3p(E) logd)/n] \\v\\i, 
or equivalently, after multiplying through by HE 1 / 2 -^^, the inequality 

\\Xv\\ 2 /y/n < 3\\V 1/2 v\\2 + 3 (V( 3 KS) logd)/n)||«||i, 
thereby establishing the claim (50). 



29 



Establishing the lower bound (24): We now exploit Gordon's inequality in order to establish the 
lower bound (24). We have 

— inf ||Xv||2 = sup — ||Xu||2 = sup inf u T Xv. 

veV(r) v€V veV(r) u ^ U 

Applying Gordon's inequality, we obtain 

E[ sup — ||Xu||2] < E[ sup inf Z u>v j 

v£V(r) veV{r) "GS"- 1 



= E[ inf <? T u]+E[sup h T T}l 2 v] 
< -E[|| 5 || 2 ] + [3p(S) log 1/2 



r. 

where we have used our previous derivation to upper bound E[sup ue y( r ) /i T £ 1//2 u]. Noting 4 that 
IE [ 1 1 ^ 1 1 2 ] > \fnj2 for all n > 1, we divide by ^/n and add 1 to both sides so as to obtain 

E[ sup (1- HXvllaA/n)] < 1/2 + [3p(£) logd] 1/2 r (61) 

veV(r) "> v ' 

k{r) 

Next define the function f(W) = s\ip veV f r \ (l — IIWE 1 / 2 ^^/-^). The same argument as 
before shows that its Lipschitz constant is at most 1/y/ri. Setting t = i^(r)/2 in the concentration 
statement (59) and combining with the lower bound (61), we conclude that 

P[ sup (1- \\Xv\\ 2 ) > jjt<(r)] < 2exp(-n%^). (62) 

v£V(r) 1 ° 

Define the event 

T := {3 v G R d s.t. ||£ 1/2 ?;||2 = 1 and (l - \\Xv\\ 2 ) > 3t € (||u||i)}. 

We can now apply Lemma 9 with a n = n, g(r) = 3t^(r)/2 and fi = 1/2 to conclude that there exist 
constants a such that P[T] < c\ exp(— c 2 n). 

Finally, to extend the claim to all vectors v, we consider the rescaled vector v = v/W^^v^. 
Conditioned on the event T c , we have for all v G M. d , 

l-\\Xv\\ 2 /^ < ^ + 3(V(3p(S) logd)/n) II^Hi, 

or equivalently, after multiplying through by HE 1 / 2 ^^ and re-arranging, 

\\Xv\\ 2 /V^ > i||S 1/2 u||2 -3(V(3p(S) logd)/n) ||u||i, 

as claimed. 



4 In fact, ] IE [ 1 1 <7 1 1 a ] — y/n\ = o(y/n), but this simple bound is sufficient for our purposes. 
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F Proof of Lemma 6 



For a given radius r > 0, define the set 

S( s ,r) := {6 G R d | ||0|| o < 2s, ||0|| 2 < r}, 
and the random variables Z n = Z n (s, r) given by 

1 i T 

Z n := sup — \w X6\. 

9eS(s,r) n 

For a given e G (0, 1) to be chosen, let us upper bound the minimal cardinality of a set that covers 
§(s, r) up to (re)-accuracy in ^2-norm. We claim that we may find such a covering set {9 1 , . . . , 9 N } C 
S(s, r) with cardinality iV = iV(s, r, e) that is upper bounded as 

IogJV( 8 ,r,e) < log^J +2slog(l/e). 

To establish this claim, note that here are ( 2 d J subsets of size 2s within {1,2,... , c?}. Moreover, for 
any 2s-sized subset, there is an (re)-covering in ^-norm of the ball B2(r) with at most 2 2slog ( 1 / £ ) 
elements (e.g., [24]). 

Consequently, for each 9 G S(s, r), we may find some 6 k such that \\0 — 9 k \\2 < re. By triangle 
inequality, we then have 

-\ w T xe\ < -\w T xe k \ + -\w T x(e-e l )\ 

n n n 



Given the assumptions on X, we have \\X(9 - 9 k )\\ 2 /V^ < ^uf\\G — ^ fc |b < K tt e. Moreover, 
since the variate ll^H^/cr 2 is \ 2 with n degrees of freedom, we have < 2a with probability 
1 — ciexp(— C2n), using standard tail bounds (see Appendix I). Putting together the pieces, we 
conclude that 

-\w T X9\ < -\w T X9 k \+2K u are 
n n 

with high probability. Taking the supremum over 9 on both sides yields 

Zn — max — \w T X9 k \ + 2k u a r e. 
fc=i,2,...,JV n 

It remains to bound the finite maximum over the covering set. We begin by observing that each 
variate w T X9 k /n is zero-mean Gaussian with variance a 2 \\X9' t \\2/n 2 . Under the given conditions on 
9 k and X, this variance is at most a^K^r 2 /n, so that by standard Gaussian tail bounds, we conclude 
that 



/31ogJV(a,r, e) 

Z n < a r k u \ h 2k u ar e 

n 



f / 31ogiV(s,r,ej 1 
«m(Y 1- 2e j-. (63) 
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with probability greater than 1 — c\ exp(— C2 log N(s,r,e)). 

Finall; 
we obtain 



Finally, suppose that e = \J sl °gW 2s ) . With this choice and recalling that n < d by assumption, 



\og Njs^e) < log( 2 d J | glo g sio g (d/2 S ) 
n n n 

< log( 2 d J | slog(d/s) 



n n 



< 2s + 2s log(d/s) + s log(d/s) 



re n 



where the final line uses standard bounds on binomial coefficients. Since d/s > 2 by assumption, we 
conclude that our choice of e guarantees that log N ^ s ' r ' e * > < 5 s log(d/s). Substituting these relations 
into the inequality (63), we conclude that 



as claimed. Since log iV(s,r, e) > slog(<i — 2s), this event occurs with probability at least 1 — 
ci exp(— C2 min{n, s log(d — s)}), as claimed. 

G Proofs for Theorem 4 

This appendix is devoted to the proofs of technical lemmas used in Theorem 4. 

G.l Proof of Lemma 7 

For q £ (0, 1), let us define the set 

S q (R q ,r) := B q (2R q )n{9 eM d \ \\X9\\ 2 /V^<r}. 

We seek to bound the random variable Z(R q , r) : = sup^g^^ r ) ^ \w T X9\, which we do by a chain- 
ing result — in particular, Lemma 3.2 in van de Geer [32]). Adopting the notation from this lemma, we 

II X.0W 9 

seek to apply it with e = 5/2, and K = 4. Suppose that 11 ^ < r, and 

\/n5 > c\r (64a) 
y/n5 > a / yJ\og N(t;§ q )dt =: J(r,5). (64b) 



16 



where N(t; S q ) is the covering number for S q in the ^-prediction norm (defined by HX^H/y^n). As 
long as < 16, Lemma 3.2 guarantees that 

F[Z(R q ,r) > 5, < 16] < ciexp(-c 2 — ). 

By tail bounds on \ 2 random variables (see Appendix I), we have IP [ 1 1 '55 1 1 2 ^ 16 n ] c 4 exp(— csn). 
Consequently, we conclude that 

n5 2 

F[Z(R q ,r) > 6] < ci exp (-C2-^-) + C4exp(-C5n) 
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For some C3 > 0, let us set 



5 = c 3 r Kc 2 ^i? 2 4. 

n 



and let us verify that the conditions (64a) and (64b) hold. Given our choice of 5, we find that 

r 

and since d, n — > 00, we see that condition (64a) holds. Turning to verification of the inequality (64b), 
we first provide an upper bound for log N(8> q , t). Setting 7 = ^= and from the definition (31) of 
absconv 9 (X/y / n), we have 

sup —\w T X6\ < sup —=\w T ^\. 

9eS q (R q ,r) n 7eabsconv 9 (X/ v / n),||7||2<r V n 

2 ~ 2g 

We may apply the bound in Lemma 4 to conclude that log iV(e; 8 q ) is upper bounded by c R q 2 ~i (^-) 2-9 log d. 
Using this upper bound, we have 



J(r,5) : 



/ JlogN(§ q ,t)dt < / J\ogN(S q ,t)dt 
J5/16 v Jo v 

f'T 

< c R q ^i k c ^ y/\ogd j t-iM-tidt 

Jo 

= dR q ~* k c ~* y/\ogd r 1 ~ I'- 



ll 

-1 



Using this upper bound, let us verify that the inequality (64b) holds as long as r = Q(k c 2 y/Rq~ (^p) 2 * )> 
as assumed in the statement of Lemma 7. With our choice of 5, we have 



j ^ c 'R q —«K c —^^r^—« 



1 1 9 q q q q ,. 9 9 ( 1 9 

C >R q 2-q 2 2(2-9) 2 ~ 2 (i2££) 4 2 -9 V 2 4, 



C3 



C3 



so that condition (64b) will hold as long as we choose c 3 > large enough. Overall, we conclude that 

F[Z(R q ,r) > csrK c 2 y/Rq (- 5S -) 5_3 ] < ci exp(—i?g(log<i) 1_ 2 ni), which concludes the proof. 



G.2 Proof of Lemma 8 

First, consider a fixed subset 5 C {1, 2, . . . , d} of cardinality |5| = s. Applying the SVD to the 
sub-matrix X$ £ M nxs , we have = VDU, where V £ R nxs has orthonormal columns, and 
DU £ R sxs . By construction, for any A s £ M s , we have ||X S A S || 2 = \\DUAs\\ 2 - Since F has 
orthonormal columns, the vector ws = V T w £ W has i.i.d. iV(0, c 2 ) entries. Consequently, for any 
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Ac such that Uj^g^gllg < r we have 

° s/n — 



w T X s A s 




w% DUA S 


n 




Jn Jn 



< 



\wsh H-^^As^ 



n 



n 



, \Ws 2 
< 1=— r. 



n 



Now the variate a 2 1 1 1 1 1 is X 2 with s degrees of freedom, so that by standard \ 2 tail bounds (see 
Appendix I), we have 



[il^lk > l + 45] < exp(-s<5), valid for all 5 > 1. 



cr 2 s 



Setting <5 = 20 log {4-) and noting that log(J-) > log 2 by assumption, we have (after some algebra) 



• 2.s 

1 2 2 

> — (811og(d/s)) 



n n 

We have thus shown that for each fixed subset, we have the bound 



< exp(-20slog(^)). 



w T X s A 



n 



< r 



'81a 2 5 log(£) 



n 



with probability at least 1 — exp(— 20s log(^)). 

Since there are ( 2 d J < (|| ) 2s subsets of size s, applying a union bound yields that 

,w T X9 



sup 

6»GB (2s), ^j^<r 



> r 



Wslogd:) 



n 



n 



, d . , de 
-20slog(— ) + 2slog — 



d 



< exp(-10slog( — )), 



as claimed. 



H Large deviations for random objectives 

In this appendix, we state a result on large deviations of the constrained optimum of random objective 
functions of the form f(v; X), where v G W 1 is the optimization vector, and X is some random vector. 
Of interest is the optimization problem sup p r v -\ <r vGS f(v; X n ), where p : R d —> E + is some non- 
negative and increasing constraint function, and S is a non-empty set. With this set-up, our goal is to 
bound the probability of the event defined by 

£ : = {3 d G S such that f(v;X) > 2g(p(v)))} , 

where g : R — ► R is non-negative and strictly increasing. 
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Lemma 9. Suppose that g(r) > pfor all r > 0, and that there exists some constant c > such that 
for all r > 0, we have the tail bound 

P[ sup fn(v,X n )>g(r)} < 2 exp(-c a n g 2 (r)), (65) 

fSS, p(v)<r 

for some a n > 0. 77ze« we have 

™, r „ n 2 exp(— ca n u?) 

P£n < . , — ' ■ (66) 

1 — exp(— ca n p z ) 

Proof. Our proof is based on a standard peeling technique (e.g., see van de Geer [32] pp. 82). By 
assumption, as v varies over S, we have g(r) G [//, oo). Accordingly, for m = 1,2,..., defining the 
sets 

S m := {veS \ 2 m ~V < g(p(v)) < 2"V}, 

we may conclude that if there exists « £ S such that /(u, X) > 2h(p(v)), then this must occur for 
some m and i> G S m . By union bound, we have 

oo 

P[£] < ^2p[3v e S m suchthat f(v,X) >2g(p(v))]. 

m=l 

lfv£ S m and f(v,X) > 2g(p(v)), then by definition of S m , we have f(v, X) > 2 (2 m ^ 1 ) p, = 2 m p. 
Since for any v G S m , we have g(p(v)) < 2 m p, we combine these inequalities to obtain 

oo 

¥[£} < J^P[ sup f(v,X)>2 m p] 

m =l P(f)<9~ 1 (2 m M) 
oo 

< ^2ex P (-m n [^- 1 (2 m / u))] 2 ) 

m=l 
oo 

= 2^exp(-ca n 2 2m // 2 ), 

m=l 

from which the stated claim follows by upper bounding this geometric sum. □ 



I Some tail bounds for x 2 -variates 

The following large-deviations bounds for centralized x 2 are taken from Laurent and Massart [21]. 
Given a centralized x 2 -variate Z with m degrees of freedom, then for all x > 0, 

¥ [Z - m > 2y/mx + 2x] < exp(-x), and (67a) 
P [Z - m < -2y/mx] < exp(-x). (67b) 

The following consequence of this bound is useful: for t > 1, we have 

F\ ~ m > At] < exp(-mt). (68) 
m 

Starting with the bound (67a), setting x = tm yields P[^^ > 2y/i + 2t] < exp(-tm), Since 
At > 2y/i + 2t for t > 1, we have P[^^ > 4i] < exp(-im) for all t > 1. 
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